On Policy Learning in Restricted Policy Spaces
نویسندگان
چکیده
We consider the problem of policy learning in aMarkov Decision Process (MDP) where only a restricted, limited subset of the full policy space can be used. A MDP consists of a state space S, a set of actions A, a transition probability function t(s, a, s′) and a reward function R : S → R. Also there is the discount factor γ. The problem is to find a policy, a mapping from states to actions π : S → A, which gives the highest discounted return IE ∑∞ i=1 γ R(s) (where s represents the state encountered at time step i) for every possible start state. However, we are not interested in any possible policy, only in a restricted, limited subsetΠ of the full policy space. The assumption will be made that there exists a policy π which is best for every state s ∈ S, compared to the other policies in Π. It is not required that the true optimal policy for the MDP belongs to Π. In some settings we can also consider stochastic policies, which map states to a probability distribution over the action set. This greatly increases the size of the policy search space.
منابع مشابه
Policy Capacity in the Learning Healthcare System; Comment on “Health Reform Requires Policy Capacity”
Pierre-Gerlier Forest and his colleagues make a strong argument for the need to expand policy capacity among healthcare actors. In this commentary, I develop an additional argument in support of Forest et al view. Forest et al rightly point to the need to have embedded policy experts to successfully translate healthcare reform policy into healthcare change. Translation of externally generated i...
متن کاملPolicy Reuse for Transfer Learning Across Tasks with Different State and Action Spaces
Policy Reuse is a reinforcement learning method in which learned policies are saved and reused in similar tasks. The policy reuse learner extends its exploration to probabilistically include the exploitation of past policies, with the outcome of significantly improving its learning efficiency. In this paper we demonstrate that Policy Reuse can be applied for transfer learning among tasks in dif...
متن کاملRegularized Policy Iteration with Nonparametric Function Spaces
We study two regularization-based approximate policy iteration algorithms, namely REGLSPI and REG-BRM, to solve reinforcement learning and planning problems in discounted Markov Decision Processes with large state and finite action spaces. The core of these algorithms are the regularized extensions of the Least-Squares Temporal Difference (LSTD) learning and Bellman Residual Minimization (BRM),...
متن کاملProbabilistic Policy Reuse for inter-task transfer learning
Policy Reuse is a reinforcement learning technique that efficiently learns a new policy by using past similar learned policies. The Policy Reuse learner improves its exploration by probabilistically including the exploitation of those past policies. Policy Reuse was introduced and previously demonstrated its effectiveness in problems with different reward functions in the same state and action ...
متن کاملContinuous-action reinforcement learning with fast policy search and adaptive basis function selection
As an important approach to solving complex sequential decision problems, reinforcement learning (RL) has been widely studied in the community of artificial intelligence and machine learning. However, the generalization ability of RL is still an open problem and it is difficult for existing RL algorithms to solve Markov decision problems (MDPs) with both continuous state and action spaces. In t...
متن کامل